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A COMPARISON OF thc t\vo lists of words 
of highest frequency dra\ni from two 
sizable corpora of English can reveal 
lone of the peculiarities or biases of the 
corpora. In making such comparisons,* 
wc have discovered that thc bias is much 
more apparent if three or four such lists 
are compared with each otiier. The bias 
can be detected by discrepancies in the 
frequencies of some words or by com- 
paring the lists of the hundred (or more) 
words of highest frequency. 

This paper will illustrate thc method 
and will demonstrate that the bias of a 
corpus of lOO.IXX) words or more reaches 
even into thc 122 words of highest 
frequency. In view of the innumerable 
man-hours that have gone into piling up 
frequency counts from corpora number- 
ing several millions of words in the effort 
to establish a bias-free frequency list, this 
is an important finding. It is made pos- 
sible in part bv the publication of the 
122 words of highest frequency in thc 
first 285,062 collected at Brown Univer- 
sity for the Standard Corpus of Edited 
American English, of which the compo 
nems were first put in print in 1961. 

The first 285,062 words put on cor t- 
putcr tape were the material of a doctoral 

*For anodicr example see our *Trequencies 
of Structure Words in the Writint of Children 
Mid Adults,** Elementary En^itb 42 (December 
IMS), 37MI2 end 894. 

Mr. Card and Mrs. McDavid are professors of 
Estpfisb at tlttnois Teaebers Collate: Cbicago- 
Soist k f formerly Cbicogo Teaebers College 
Somb. A fhst version of this paper -was read at 
tbe nest Regional Mee^ of tba Chicago Lin- 
fsdtrif Society, Aprtt 4, /MS; a revised version 
mm mod at i6e Midwest Modem Language 
Atsocktion convention hi Chicago May 7, IPtS. 



dissertation by George K. Monroe.- Wc 
will call this portion of thc Standi- ' 
Corpus the .Monr r corpus or .M. .* 

thc purposes of thc dissertation it 
convenient to have the computer nnV: 
a frequency count. Thc first page of tl r 
computer printout held 122 words .ir ! 
their frequencies. They are displaced i-i 
Table V’® of the description of the Sra-’. 
dard Corpus given by W. Nelson Frar..:\ 
in College English 26 (Januarv . 
267-73. In Table I Francis lists the c.;rr 
gories of publications sampled for :!.r 
whole Standard Corpus and the nurnl-cr 
of 2000- word e.vccrpts (to a total of ?oi • 
taken in each main category. W’c lu c 
not been able to learn from the Depan- 
ment of Linguistics at Brown which 
excerpts were included in the .Monn*c 
corpus and will be unable to account f*.r 
thc bias of M which our comparisrtav 
reveal. 

In Table I we array thc 122 words in 
a column; in column M wc give the ran'*; 
order of each and in column f the fre- 
quency. In three columns alongside M 
wc give thc rank order of thc same wor»!v 
in three other frequency counts. 

Column D is derived from Godfrey 
Dewey, Relathf[e] Frequency of Englh^ 
Speech Sounds, Harvard Univcr^it\ 
Press, Cambridge, 1923. Like .M, Dewev"' 
corpus was a composite. It was assembled 
in the spring of 1918 and was chiefly b-t 

^Phonemic Transcription of Graphic 
base Affixes in English: A Computer ProHr*". 
Brcm'n University, June IMf. 

4Bv mistake the word at order 92 is ** 
bit instead of jlfrr. We have also corrct*<*l 
(from Monroc*s thesis) the mistaken estimitc 
the size of M. A detailed description of 
Standard Corpus is given in the .Matitta: a- 
InformatioH which can be obtained from 
Department of Linguisdes at Brown. 
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not strictly contemporary, some 10 per 
. rent of the 107,138 running words hav- 
' ing been composed before the twentieth 
century. Newspaper excerpts supplied 30 
per cent of the corpus, magazines 25 per 
y cent. Only 10 per cent was from fiction 
jnd 5 per cent from drama. Dewey lists 
the thousand words of highest frequency 
in his corpus. In column P we give the 
rank order in Dewey’s list of the 122 
' cited from M.* 

Column U show’s the rank order of 
the same W’ords as given in Miles L. 
Hanley, Word Index to fames Joyce*s 
Ulysses f Madison, Wisconsin, 1937 
(mimeographed],* which notes the fre- 
quency of the 260,430 words, name 
initials, etc. in Ulysses, which Joyce 
composed in the years 1914-21. 

Column R furnishes for the same 
words the rank order derived from 
Henry D. Rinsland, A Basic Vocabtdary 
af Elementary School Children, Mac- 
millan, New York, 1945, which lists 
alphabedcally with their frequencies the 
14,571 words occurring three or more 
tunes in any one graefe from the first 
to the c^th in a corpus of 6,012,359 
running words of schoolchildren’s writ- 
ing (indoding 4630 pages of recorded 
conversation of children in first grade) 
colleaed in die spring of 1937 from 
|i schools in all parts of the country. 

For the reader’s ease we carry out the 
exposition in this paper in terms of rank 
order ladier dun nw frequency: the 
numbers are analler and easier to follow. 
In making a stadsdcal dieck of the judg- 
|i ments in section n, we took the raw 
I frequencies, adjusted them to equate 

i 

*Ociirnr.droiMad abbcaviatkNM from hb Ibr. 

I Wt have ia m ted Mr. and Mrs. at their proper 
I daces and aditMod the rank orden under D to 

I M^dum. 

*In App an d i a II of thb volume Martin Jooa 
|| fives the saafa ordsr in U of the first 100 words 
n E>eway1i Use. Thb luiifiiad to ua that it 
I t «odd he h n ai aadna to and M to tha compari- 
I I sen. A f a w asr collaagui. Dr. AMa RauUn, nw- 
I f fcsead dmt we add ina Rindand frequencies. 



with A corpus of 285,000 words, and 
ran a chi-square test. Except for those 
otherwise labeled, all the statements of 
significant differences in rank orders in 
the next section are valid for the raw 
frequencies we!! within the .00! level of 
confidence, which is to say that such 
differences in frequency (and hence rank 
order) would not turn up as often as 
once in a thousand times by mere chance. 
The excepted data are labeled for the .01, 
.01, or .05 levels of confidence, which 
relate respectively to a chance occur- 
rence once in a Hundred, once in fifty, 
or once in twenty times.* 
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Ta»lf, 1 (cant'd) 
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The most common words tre the ones | 
whose rank order in a frequency list arc 
most firmly fixed for a corpus of a given 
kind. Yet bias may show itself in the first 
few ranks if the corpus differs suffi- | 
cien^ from the norm. We note that M 
and D agree on die ranking of the first || 
six woriu; U agrees to the list but re- || 
verses the rank of a md to; but R agrees 
to only four of the list, m and eqiedally 
of droroing down to positions lower in 
order, vl^tn frequencies of the magiu- 
tude of diose occurring in the first six 
tanks, the chance diat the shifts in U and 
R are a random effect is a great deal 
than one in a thousand. 

M and D agree on 15 of the first 
words, as indicated by the fact that only 
one number of the first 36 under D is 
higher than 36. The word not agreed • *' 
is /, which stands at order 10 in D aiid t 
and at order 2 in R but b not iiwhided in 
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the 122 of M. Under U, 10 of the first 
J6 numbers are higher than 36, and under 
R, 12. Further study confirms the first 
impression that D is closer to .M than the 
ether two lists are, that R is farthest from 
M, and that U is between R and D but 
somewhat closer to R. 

The farther down the list one qocs, 
the greater the discr«ancics in the rank 
orders. The point of greatest diversion 
happens to be at the word state, order 
100 in \I and 282 in D. The discrepancy 
of 182 rank orders is greater (as between 
M and D) than for any other word on 
%he list and approached only bv use at 
rank 1 15. The rank order of state in R 
and U is also ver\" much lower than in 
M. The second most extreme discrep- 
ancy between U and M is also for the 
word me. Evidently the frequencies of 
state and use are abnormally high in M. 

Other words of markedly higher fre- 
quency in M than in the other lists are 
new 44, years 84, yegr 99, used 122, also 
75, even 86, last 83, each 96, and Mrs. 
92. For Mrs. the E^wey frequency is 
undoubtedly rather low: only the 15 per 
cent of fiction and dranu in his corpus 
would be likely to heighten the fre- 
(juenw .of the title. The Standard Corpus 
includes 5 excerpts from women's maga- 
zines and 3 from newspaper society 
pages; the fiction comprises 25 per cent 
of the whole corpus. A disproportionate 
amount of this matter must have got into 
the Monroe corpus. We predict that the 
frequency list of the Standard Corpus 
will place Mrs. below Mr. but not as low 
as the l^we^r «rdw of 236. The other 
words listed in this paragraph will also 
probably have a lower relitive frequency 
in the c^nt of the Standard Corpus, 

Not so widely divergent frwn the 
ocher orders buc'sdll of notably higher 
order in M are the determtnen lofur 56 
and (at the .02 level) these 58; the nu- 
merative adjectivet fvforr 41, first 66, most 
70. and (at the .01 level) two 67 and (at 
.02) other 54; the models eon 42 and may 
55 (bnt only at .01). The words of not- 



ably low frequency in M are your 68, 
them 71, viy 98, and (at .02) how 104. 
Whether the frequencies of all these 
words arc the particular bias of M or of 
the Standard Corpus itself will appear 
when the frequency list of the Standard 
Corpus is published. 

Reserving discussion of orders higher 
than the 122nd for the ne.xt sectionr we 
turn now to note some of the peculiar- 
ities of the other corpora. As we list 
words wc will give their rank order in 
the .M column to make it easier for the 
reader to find the words. 

A remarkable peculiarity of UlysseT} 
is the relatively low frequency of several 
verbs. Half of them are also low in the 
children’s writing: be 15, has 30, should 
91, and we inay*^add must 87 (but only 
at the .05 level in U) and were 37 (but 
only at .05 in R). Those low in U but 
unremarkable in R include is 7, are 19, 
have 23, will 29, and would 39. Presum- 
ably the relatively low frequency of is 
and are is due to the heavier use which 
the brgely expository corpora make of 
the timeless present * in generalizations. 
TTte fact that was 13 stands at virtually 
identical ranks in all the corpora seems to 
confirm this supposition. The same may 
be said of the low frequencies of have 
and has as compared to the nearly iden- 
tical fre quency of had 34. 

On the otKer hand, fiction has more 
need for said 51 than exposition has. 
Hence the rank orders of 72 and 75 in 
R and D, M^hich are predominantly ex- 
po3ttor>% and the order 23 in U.’ The 
fact that M is midway beteveen the ex- 
tremes suggests once more that it con- 
tains a disproportionately large share of 
the fiction of the Standard Corpus. VVe 
niust come hack to Joyce’s verb system 
in the next section and will postpone 
further comment. 

The fact that the children use have 23 
and had 34 more frequently than th..* 
other corpora suggests that they need it 
in the sense of "poMest" more often than 

adults. The reason ^'hy has 30 is com- 
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paratively low in frequency in their 
vtxrabular^* is that they are more con- 
cerned with what and you” possess 
duin with what ”he, she, and it” do. This 
can be seen in the high frequency of the 
possessive determiners tny 98 and your 68 
and the low frequency of hh 21 and its 
48. While her 61 is at about the same 
rank level as in the adult writing, she 7 3 
stands higher on the children’s list than 
on the adult; this means that the children 
need her as the object pronoun oftener 
dian the adults do and hence use it less 
often as a determiner. Both their 38 and 
our 6S are nlso low in frequency in R as 
compared to they 32 and we 35. 

In R the orders of is 7, was 13, are 19, 
wiU 29, can 42, and do 72 are close to 
those of D and M. The orders of he 15, 
were 37, been 43, may 55, could 80, 
should 91, and being 118, and (at .05) 
mttst 87 are lower in rank than in M and 
D. The lower frequencies of be and been 
are partly a reflecrion of die lower fre- 
quencies of the modals and paniv, along 
with the lower frequency of oeingf a 
reflection of the lesser use made by chil- 
dren of the more complex verb phrases. 
Children may not need die morai impera- 
dvea of should and must as often as 
adults, or they may substitute have to 
and has to. The logical use of must (“and 
it must follow as me night the day”) and 
the probabilistic senses of may wiU ap- 
pear leas often in the immature style of 
children. If we add :he instances of 
eoulisft in the RinsViad corpus ^.o die 
kmancct of coukl^ the total would stand 
at the 84th tank, which is just between 
D and M. If we combine the same two 
words in D, diey stand at die 82nd rank. 
The dtflfcrence between die adult and 
diOdren's use of die wonk is merely the 
greater frequen^ of the informal 
cotddrft in die chikiren's v/iiting. 

There are ocher words than verbs for 
which U and R share a difference from 
M and D. The lower frequency of hr 48 
reflects die more personal world of dis- 
coursc the novel and of the chil lren. 



This is also the reason for the hi(,»licr 
frequency of she 73. Though the wV*?;: 
of children may be slightly less inascui;!;; 
than the others, the world of U is not 
he 16, him 59, you 33 (also high in D ,n.i 
higher in R), her 61, your 68. and \u, 
93 are all of comparatively high fre- 
quency in U; only we 35 and our 6 irt 
comparatively low. 

The greater concreteness of the cb.ii- 
dren’.s and the novelist’s worlds are suit- 
gested by the higher qucncics of the 
adverbs or prepositions of location a'v! 
time- out 60, up 64, -and then 79 (and fi.r 
U alone wc may add over 74, after SI. 
and now 82). Wc can discern differences 
of rhetoric in the lower frequencies cf 
fof‘9, which 27, since 120, own 119 (but 
only at .01 in U), and for U alone, but y.. 
these words are not needed as much in 
narrative as in exposition, and they re- 
quire a somewhat more complex style 
than all the children have mastered. (Of 
similar character is U’s preference of too 
117 over also 75.) R and U agree in giving 
like 103 a higher rank order. As a verb 
like would appear more frequently in 
personal than impersonal discourse; as a 
preposition of comparison it would be 
preferred to such as by both Joyce and 
the children. In her sl^pward soliloquy. 
Molly Bloom’s preposition is like, never 
such as. The lower frequency of any 77 
suggests a lower frequency of generaliza- 
tions in R and U. 

The personal character of the chil- 
dren’s world of discourse is also reflected 
in the strikingly low rank of of which l< 
tench in R and second in the other 
corpora. In a world in which personal 
nouns and proper names are specially 
frequent, the possessive inflection is more 
frequent and the periphrastic genitive 
wi^ of less so. 

Ocher prepositions of comparatively 
low rank in R are vntb 11, by 17, and 
from 22. The discrepancy is greatest for 
by, the agentive preposition in passive 
sentences, which do not occur so fre- 
quently in children's as in adult writing. 
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Of course by is ilso an adverb, bui as 
such it has fewer meanings and enters 
into fewer idioms thj*n do to 4, hi 6, & 'd 
on 14, which are in about the same rank 
in R as in the other corpora. Evidt-Jy 
in the syntax of children’s writing even 
the simple prepositional phrase is less 
frequent than in adult writing. 

In R an 28 is much lower in rank than 
elsewhere. The e.xplanation seems to us 
to be partly the regional distribution of 
a before vowels and paitly the age at 
which children master the alternation of 
a and an (if they ever do). The use of a 
before vowels is most common in the 
South Midland and South, where it is a 
nonstandard and receding feature, and in 
large urban centers. Since Rinsland was 
careful to draw* upon all parts of the 
countr}' and all kinds of schools for his 
corpus^ tliese areas are well represented 
in R. Disadvantaged youths of college 
I ^ sometimes substitute a for an in writ* 
l ing. We have also noticed that middle 
'class children of well-educated parents 
hive not fully mastered the use or on by 
age eight or nine, perhaps because it u 
the lone survivor of udtat was once a 
more extensive pattern in which mine, 
thine, and none also occurred before 
vo^areb and my, thy, and no before 
consonants. 

We will mention that R substantiates 
the compbints of teachers that pupils 
overuse very 111 and begin too many 
sentences with there 45. But other points 
diat might be made in expbnation of 
the rank orders in R and U we will have 
to leave to the reader to ntake for himself. 

m 

Another wmy establishing the bias 
of a corpus b to compare its most fre- 
” quent wofdswith^diesexif odier corpora. 
Of die first 122 In M, 76 worth are com- 
mon to the first 122 of all four corpora. 
(These 76 can be derived from Taob 1 
by noting which words have no number 
hif*«r than 122 in the coHmuis R, U, and 
D./ How the four fists of 122 words 
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differ can be seen in Table 2, xvhich lists 
the words uniquely present or absent in 
each corpus. (That is, under +R arc 
listed the 24 words that appear in the 122 
most frequent in R but not irii the first 122 
of any of the other three lists. Following 
the word is its rank order in R. Under -R 
appear the words that arc in the first 122 
of U, D, and .M but not in the first 122 
ofR.) 

+R 

school 26. am 29, got 14. went 36. dear 44, 
going 47, mother 64, friend 66, home 74, Christ- 
mas 79, write 81, play 83, came 84. put 85, 
house 87, saw 91. dog 102, name 107, want 1 10. 
soon 117, take 118, letter 130, sure 121, bov 122 
— R 

which 134, more 136. who 141, way 154, before 
158, no 163. only 195, must 231, under 293, Mr. 
301, its 312, those 491 

+U 

Bloom 30, Stephen 56, sa>’s 61, off 76, yes 77, 
eyes 80, O 81, hand 90, street 93, again 109*, 
face III, right II3, round 115, head 119* 

— U 

been 126. can 128, made 145, make 157, much 
162*, manv 323, people 354 

+D 

war 55, men 72, great 85, upon 89, every*90, 
shaU94 

—D 

just 129, too 183 

+M 

also 75, last fit, years 84, Mrs. 92, each 96, year 
99, state 100, work 114, use 115, three 116, own 
119, since 120, sdU 121, used 122 
—M 

(R rank first, U second, D diird) 
come 67. 97. 92; did 75, 71. Ill; down 78, 62. 
120; here 94. n, 105; I 2. l<k 10; know 109, 82. 
115; Uede 54. 92 y»; me 27. 29. 47; see 65. 67. 
113; us 88. 102*. 93 

Other results of the comperison are 
presented in Table 3, which arrays the 
words common to any pair of lists and 
absent in the odier pair. (Under -|-DM 
ire the words that wear in dif 122 most 
commcrti in the Dewey and Monroe 
coraora but not in die 11*2 most common 
in Ulysses and the Rinsbnd corpus.) The 
three tables furnish the reader all die 
information he needs to derive and order 
the three lists of 122 words odier than M, 
which is: already given in Table f. 
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Tamx 3 
+DM 

(D rank first, M sccont!) 
new 120, 44; than 66, 49; may 16, 55; these 81. 
5t; most 109, 70; such 88, 76; any 57, 77; even 
122, 16; world 115, 88; should 96, 91; being 103, 
118 

fUM 

(U rank first, M second) 
through 106, 95; where 89, 107 
+RU 

(R rank first, U second) 
go 41, 114; get 50, 101; back 98, 75; night 106, 
112; old 106, 58 

+UD 

(U rank first, D second) 
say 100, 100; never 104, 117; long 116, 122 
4RD, +RM 
No such 'CIS. 

To take the most obvious example 
first, it is easy to discern the bias of R 
and to account for it. The 140 or so 
adults who wrote M discussed so many 
different topics that, as we analyze in 
Table 4 the 122 most frequent words, 
only 20 of them are content words. The 
100,000 and more children who wrote 
die Rinsland corpus lived in a much 
smaller world of discourse. 

The words under +R of Table 2 
describe this world: it has a school, a 
mother, a friend, a house, a home, 
Qiristmas, a dog, a name, a letter begin- 
ning with the word dear, and a boy. Its 
inhabitants write, play, put, want, take; 
they got, diey came, tney went, they 
saw . . . soon and sure. 

The structure word §m appears for 
use widi If which in the uninhibited 
wridng of children is die word of second 
highest frequency (see last section of 
Table 2). In -R appear the title Mr. and 
die ten or Jeven (if we count tosy) 
structure words crowded off the list by 
the content words of -fR* It is interest- 
ing to note that childmi do not need 
more and most (+DM, Table 3) u often 
as adults, which suggests that they make 
heavier use of inflected ad’iectives. It is 
also interesting duit they ^re widi U 
a greater need for old (+RU) and a 
leaser need for nevf (4-D.M). On the 
whole die excluded words in -R and 



-f DM suggest the more oral and simpler 
syntax of children’s writing. 

In the list under -fU, Bloom 31 f 
Stephen arc obviously a bias of the nmcL 
The Dublin misc en scene contrihutet 
street. Eyes, hand, face, head come pn:;. 
from the particularity of fiction, p.irrl. 
from the highly personal nature r • 
Joyce’s narrative. The w ord says occu."> 
470 times in Ulysses, but at least 3 '.. 
instances arc in a 52-page chapter 
ning, “I was just passing the time of d. v 
witn old Troy” (Random House cd.. 
pp. 287-339), where nearly every p.ir.i- 
graph of conversation has the word s.iy s 
Its appearance in -f U is the result of the 
stylistic bias of this chapter. Also stylistic 
arc yes and O. Molly Bloom is a high!\ 
affirmative woman: in the last chaptcr- 
hcr nocturnal soliloquy— she says yes at 
least 70 times and O at least 48. 

Taking as his base the fiirst hundred 
words in Dewey’s frequency list, Martin 
Joos observed thirty years ago in .Ap- 
pendix II of Hanley’s Word Inde.x (p. 
386) that Joyce was* not very fond of the 
words may, should, shall, and these. 
From -U we can add been, can, make and 
made, and also much and many. From 
-1-DM wc can add being to the other 
verbs, and from Table 1 we can add this 
18 to these. If we add to the verbs in 
tills paragraph the 9 others of compara- 
tively low frequency cited in the 'c- 
vious section, it is apfiarent that the sys- 
tem of verb phrases in Ulysses differs 
markedly from that of the Dewey and 
Monroe* corpora. How much of this to 
attribute to the differences in genre, how 
much to the difference between Irish 
English and .American English, how 
much to the style of Joyce, and how 
much to the styl’c of Ulysses arc interest- 
ing questions that only another computer 
program could answer. 

Wc now go on to -f-D. Dcvi'cv him- 
self noted that tear was unusually fte- 
quent because his corpus was collected in 
the spring of 1918, about half of it from 
newspapers and magazines. This 










ENGLISH WORDS OF VERY HIGH FREQUENCY 603 



accounts for the presence of great (as in 
i the phrases “great war, great powers, 
II ^reat losses'*) and for men (as in discus- 
r sioos of military actions and losses). Some 
I features of the corpus would exaggerate 
I the frequency of shall: 5000 words of 
|| editorials from the Boston Evening 
I Transcript, 5000 from Abraham Lin- 
I coin’s sf^ches, 2000 from the sermons 
I of Henry Ward Beecher and Phillips 

I Brooks, etc. In our opinion, an additional 

I reason for its absence from the other lists 

I is a diminishment in its use in American 

I English since 19 II 8. Whether the presence 

ft of upon at rank 89 and the determiner 

I euery at rank 90 is due to a diachroniic 

I shift in the language or to the bias of 

p the Dewey corpus we cannot say. We 

I may kno>» more about thb when the 

I frequency list for the whole Standard 

I; is published. 

I Tne absence olf too (-D) is surely a 
I resole of bias in tl^e corpus; But just at 
I order 129 is almost on the list: if there 
I' had been more colloquial matter in the 
I corpus, pf 'Kild have been included. 

I As tO[ th., j^iiil bias of M, we note 
I that there art H vmrds in -f*M and only 
|| 6 in q-D. SiiiJlarly there are 10 words 
I in -M and only 2 in -D. Judged by this 
I criterion as well as by Ae number of 
I wurds that are of comparatively high 
^ frequency in M alone, M would appear 
I to Iw further from the baseline of Ameri- 
I can English in respect to high-frequency 
I words man D. 



I ^ Havii^ Abwn Aat the bias of a corpus 
j is reflected even in its 122 most common 
1^ words (Aough not much in Ae first 35 
I unlm Ae corpus is quite peculiar) and 
*ttving shown that some aspects of Ae 
j 1^ luay be detected by comparison with 
I other frrquency lists, we will proceed to 
I hazard a few guesses about Ae 122 most 
frequent worA of Ae Standard Corpus. 
We ntmise that when Ae Use is put^ 
fened the words cowe, /, little, me, and 
|| «r wffl appear, and Aat Ae words Mrs., 
I use and used will not a|q>ear. 

I As to die more diqiutable territory of 



I 

if” 

1:4- 



each, last, years, and own as against did, 
down, here, know, and see, we think it 
likely that more of the last five will 
appear than of the first four. 

IV 

In Table 4 we array, the 122 words of 
M in a grammatical classification to de- 
termine what per cent of the corpus is 
taken up by the several subclasses. To 
keep our groups from getting too small 
and too numerous we have had to make 
compromises and approximations. Other 
grammarians would have made other 
decisions: anyone who does not like our 
grouping can make his own from Table 1. 

TA«Lr. 4 
Per 
cent 
of 

corpus 

10.7 A. 0 detenninen: the, a, an, their, its, 

no, our, your, my 

3J B. 11 detemUiMn/pronominab: that, 

Ats, his, all, some, these, her, 
any, those, each, much 

6 £ C. 7 prepositi^nis: of, for, with at, 
from, into, like 

6J D. 10 prepositionaAdveAials: to, in, on, 
by, out, about, up, over, 
through, under 

1 E. 9 advorUals: only, also, then, now, 
even, just, very, too, adll 

2.7 F. 8 personal pronouns: it, he, th^, 

you, we, him, them, die 
3.5 G. 8 correlatives: and, or, but 
I H. 8 relativcs/interrofativcsi which, 
who, what, how, where 

1.7 L ^1 numoriuiveadjectivalst one, more, 

ocher, first, two, moec, such, 
many, last; three, own 

08 J. 8 eubord^naCors: if, when, after, be- 
cause, before, since 

4.7 K. 11 auxiliaries: is, was, be, arc, have, 

has, had, werp, been, do, beiiy 
IJ L. 7 Bodals! will, would, can, laay, 
could, must, should 

1.7 M. 8 miscellaneottst *$, not, there, Aan, 

so 

45J 108 atruefuro words 

2.1 86 MntenC words: new, said, tune, 

vea's, made, world, good, man, 
Mrs., Mr^ year, state, people, 
*?ay, make, wdl, day, work, 
use, used 

We define determiners strictly so dist 
there is only one prenominti dmrmtner 
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to a noun phrase (including the zero 
allomorph of a/an) \ that is, determiners 
displace each other. Then the numcrative 
adjectivals are the words that can occupy 
the slot between the determiner and the 
first veritable descriptive adjectivc—for 
example, tv!0 and other in the phrase 
“the two other friendly fellows.*’ Except 
for the cardinal and ordinal numerals, 
which are theoretically infinite, this is a 
small, closed class. A few of our other 
groups are more of a mixture, some of 
their m^^mbers having more than one or 
two grammatical functions. 

The line betxveen content words and 
structure words is admittedly a blurry 
one. The auxiliary have belongs on one 
side, the lexical have meaning “possess, 
etc.“ belongs on the other. We have 
assumed that the uses as an auxiliary in M 
outnumber the others. For way^ on the 
other hand, we assumed that its content 
uses outnumbered its structural ones, as 
we did also for used. Our use of the term 
“content** in this sense is not intended to 
imply that structure words do not have 
meaning or content. 

As we clasafy them, the determiners, 
die determiners that are also pronom- 
inab, the pceposidons, and the preposi* 
dons that are also adverbials account for 
over a quarter of the whole corpux Add 
to these the modals and auxiliaries, and 
55 structure words account for exactly 
a diird of the corpus. The 20 content 
words appear 5862 times, the 102 struc- 
ture wm:^ 130,505 times. Since die 102 
are less than a fifth of the structural 
vocabulary of EngUdi, it appears that 



more than half the words in a t\*pica! 
composite corpus will be structural.^ 

The last lines of Table 4 call to our 
attention "the fact that thfte is the noun of 
highest frequency in M, as was Zeit in a 
very large German corpus counted bv 
F. W. Kaeding at the end of the la.^c 
century.^ We take it this signals com- 
mon elements in western culture rather 
than the Germanic brotherhood of the 
languages. After war, time is the most 
frequent noun in D. In U, after Blomt: 
and Stephen the first common noun is 
man, followed by time. In R the first is 
school, followed by time, and then bv 
mother. Evidently the bias of a corpus 
can sometimes be discovered in its first 
noun alone. 

One further point of interest: of the 
122 words, only 12 are not of OE origin. 
They, their, them are early ME borrow- 
ings from ON. Air. and because arc re- 
flectively fusion and compound of OE 
and OF or Latin elements; and Mrs., 
state, people, just, very, use, and used are 
of OF or Latin origin. All but Mr. ( 1 5th 
c.) and Mrs. (Idth) arc ME borrowinjsrs. 
All tcld, they occur 4172 times in the 
corpiis (1.5 per cent of it) as against 
132,195 occurrences of the native words 
(46.4 per cent of the corpus), which 
occur almost 32 times as often as the 
borrowed ics. Not only do structure 
words bul^c large in a typical corpus; 
they are almost exclusively native 
English words. 

^Ernest Tlom. d Bute Writing Vocahilary. 
Univenitv of Iowa Monographs in Education. 
1926, p. It9. 
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